Experimental Design

In 2021, DNA was extracted from 197 isolates of Ascochyta rabiei and sent for Whole-Genome-Sequencing (WGS) at the Australian Genome Research Facility (AGRF, Melbourne) on 1 lane of a NovaSeq flowcell, producing 150 bp paired-end reads (run name CAGRF20114478).
Details of the sequenced isolates is provided in (Table 1).

datatable(samples_table, caption=tbls("samples"), rownames = FALSE) %>% # , 
          # options = list(dom = 'tf', pageLength = 40)) %>%
  formatStyle('Path_rating',
  backgroundColor = styleInterval(0:4, c('limegreen','gold', 'orange', 'orangered', 'firebrick', 'darkred'))
)# pander , justify="left"

Aims

  • Identify strain-unique variants to develop detection methods
  • Associate aggressiveness with specific variants

Analysis Pipeline

General overview:

  1. Data pre-processing:
    1. Quality check
    2. Adaptor trimming
    3. Post-trim quality check
  2. Mapping reads to a reference genome
  3. Reads deduplication
  4. Variant calling and filtration
  5. Variant annotation (including assigning SSR haplotypes)
  6. Variant-Pathogenicity association
  7. Produce variant statistics and assessment

Methods

DNA-Seq data processing, mapping and variant calling were performed on the QCIF Awoonga HPC Cluster (using PBSPro scheduler), using Snippy v4.6.0, a rapid haploid variant calling and core genome alignment. The pipeline uses FreeBayes v1.3.5 [@garrison_haplotype-based_2012] and other tools to assign variant probability scores and call variants.

Consider the following options for file download and processing:

  1. Download and perform all processing on the compute node $TMPDIR (limited to 20GB), then cleanup (using --cleanup in Snippy) and copy just the important results (fastq.genozip files, QC reports and Snippy results) to scratch folder (30days or 90days on Awoonga) and into CloudStor (with rclone)
  2. Download and perform all processing on scratch folder (30days or 90days on Awoonga), then cleanup and upload into CloudStor.

At the current pipeline, all processing is done at the the read files were downloaded and then following a standard naming conventions (to start from a pair of SampleID_R#.fastq.gz file), to make it much easier to process all files with the same script, using parameters to specify batch name, read length, sequencing platform and other potential variables.
Detailed methods, including code for running each of the analyses steps are provided in the associated A_rabiei_WGS_analysis GitHub repository.

Data pre-processing

Install needed software in a conda environment on the new Gowonda HPC cluster.

# download conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# install conda
bash Miniconda3-latest-Linux-x86_64.sh 
# initialise conda
source ~/.bashrc
# add channels and set priorities
conda config --add channels conda-forge
conda config --append channels bioconda
# install extra packages to the base environment
conda install -n base libgcc gnutls libuuid readline cmake git tmux libgfortran parallel \
  gawk pigz rename genozip autoconf sshpass 
# install snippy (need to fix internet connection to gowonda2 - use patched netcheck in ~/bin)
# source ~/.proxy
CONDA_NAME=snippy
conda create -n $CONDA_NAME snippy sra-tools bcbio-gff libgd xorg-libxpm bdeops \
                libpng libjpeg-turbo jpeg snpsift rename biobambam bwa-mem2 sambamba \
                libtiff genozip parallel qualimap multiqc bbmap fastp genozip
# Clean extra space
# conda update -n base conda
conda clean -y --all
# cpanm git://github.com/IdoBar/XML-DOM-XPath-0.14@patch1
# cpanm --force Bio::SeqIO
# Install pdfx to parse the report and download the files, see https://stackoverflow.com/a/33173484
pip install pdfx
REF_DIR=$HOME/data/reference_genomes/Ascochyta_reference_genomes # on Griffith HPC
# REF_DIR=$HOME/data/reference_genomes # on Awoonga HPC
WORK_DIR=$HOME/scratch/A_rabiei_WGS_2021 # on Griffith HPC
# WORK_DIR=$HOME/90days/data/A_rabiei_WGS_2021 # on Awoonga
# download reference genomes
source ~/.proxy # on Griffith HPC
wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/011/695/GCF_004011695.1_Arabiei_Me14/GCF_004011695.1_Arabiei_Me14_genomic.fna.gz -O $REF_DIR/GCF_004011695.1_Arabiei_Me14_genomic.fna.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/011/695/GCF_004011695.1_Arabiei_Me14/GCF_004011695.1_Arabiei_Me14_genomic.gff.gz -O $REF_DIR/GCF_004011695.1_Arabiei_Me14_genomic.gff.gz

# convert gff3 (from AUGUSTUS) and fasta files to genbank format
GENOME="$REF_DIR/GCF_004011695.1_Arabiei_Me14_genomic"
ls -1 $GENOME*.gz | parallel pigz -d {}
ln -s $GENOME.fna $GENOME.fa
genozip --make-reference $GENOME.fa
# cd $HOME/scratch/data/reference_genomes/Ascochyta_reference_genomes/
# download conversion script
# wget https://raw.githubusercontent.com/chapmanb/bcbb/master/gff/Scripts/gff/gff_to_genbank.py
# python gff_to_genbank.py $GENOME.gff3 $GENOME.fasta

Download the list of raw .fastq.gz files from AGRF and upload to CloudStor.

mkdir -p $WORK_DIR
cd !$
SSHUSER=NiloofarVaghefi1
SSHPASS=Ashtad3721
AGRF_DIR=files/AGRF_CAGRF20114478_H3HGFDSX2
# get the list of files from AGRF - doesn't work on Griffith HPC because of proxy settings - try using corkscrew (http://wiki.kartbuilding.net/index.php/Corkscrew_-_ssh_over_https)
sshpass -p $SSHPASS ssh $SSHUSER@agrf-data.agrf.org.au "ls -1 $AGRF_DIR" > AGRF_CAGRF20114478_H3HGFDSX2_files.list
# download md5 file
sshpass -p $SSHPASS rsync -avzh $SSHUSER@agrf-data.agrf.org.au:$AGRF_DIR/checksums.md5 ./
# rsync NiloofarVaghefi1@agrf-data.agrf.org.au:files/AGRF_CAGRF20114478_H3HGFDSX2/'*'
#rsync -avh -e ssh IdoBar1@agrf-data.agrf.org.au:files/AGRF_CAGRF19461_HGMLMAFXY ~/scratch/data/A_rabiei_WGS/
# download 2019 sequencing
# AGRF
rclone copy -P --max-depth 1  CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/AGRF_CAGRF19461_HGMLMAFXY/AGRF_BT2_process_04_04_2019 AGRF_CAGRF19461 --include AGRF*.fastq.gz
# AgVic
rclone  copy -P --max-depth 1  CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/AgVic_WGS2018 AgVic_WGS20
18
# Macrogen
rclone copy -P --max-depth 1  CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/Macrogen_sequences_1702KHP-0
164 Macrogen_sequences_1702KHP-0164

Adaptor Trimming

Adaptors needed to be removed, as well as very low quality bases/reads, so trimming was performed with fastp v0.20.1, which also produced QC figures and reports.
Raw files quality after adaptor trimming was assessed with FastQC v0.11.8.

Mapping to the reference genome and calling variants (using Snippy)

The trimmed reads were mapped to the A. rabiei reference genome, strain Me14 (GCA_004011695.1) using Snippy v4.6.0. Snippy is a wrapper that makes use of popular bioinformatics tools, such as bwa-mem v0.7.17-r1188 to map the reads to the reference genome, followed by several commands of samtools v1.12 to specify a Read Group for each sample (provided to Snippy with the --rgid flag), mark duplicates and convert the alignments into coordinate-sorted, indexed BAM files.
The alignment files are then processed by Freebayes v1.3.5 to call variants from all samples and bcftools v1.12 and snpEff v5.0 to filter and annotate the variants and retain only high-quality variants (based on minimum depth and genotype quality thresholds) that are common to all samples and referred to as core SNPs.
Mapping statistics were obtained with Qualimap v.2.2.2-dev [@okonechnikov_qualimap_2016] and consolidated along with pre and post-trimming QC measures into a single, interactive report for each batch using MultiQC v1.10.1 [@ewels_multiqc:_2016].

Processing of the files was done on Awoonga/Griffith HPC clusters

# work on Awoonga (because Griffith HPC can't download the fastq files directly...)
CONDA_NAME=/export/home/s2978925/miniconda3/envs/snippy
conda activate $CONDA_NAME
BBMAP_REF=$(find $CONDA_PREFIX -wholename "*resources/adapters.fa")
# Prepare the BBduk commands
DATE=`date +%d_%m_%Y`
BATCH=AGRF
RUN="${BATCH}_snippy_${DATE}" # day of run was 02_02_2019
RUN_DIR=$WORK_DIR/${RUN}
mkdir -p $RUN_DIR
cd $RUN_DIR
# GENOME="$HOME/scratch/data/reference_genomes/Ascochyta_reference_genomes/A.rabiei_me14"
# READ_LENGTH=150
# PLOIDY=1
NCORES=12
RAM=28
# LANES=$( ls -1 ../*.fastq.gz | egrep -o "_L[0-9]+?_" | sort | uniq | wc -l )
RGPM=NovaSeq
RGPL=ILLUMINA
RGPU=H3HGFDSX2
RGCN=AGRF
COMMON_ID="DO16094321_HFKYMALXX"
case $RGPM in
    NextSeq )
        CLUMP_PARAM="dupedist=40 spany" ;;
    HiSeq2500 )
        CLUMP_PARAM="dupedist=40" ;;
    HiSeq3000 )
        CLUMP_PARAM="dupedist=2500" ;;
    HiSeq4000 )
        CLUMP_PARAM="dupedist=2500" ;;
    NovaSeq )
        CLUMP_PARAM="dupedist=12000" ;;
    * )
        CLUMP_PARAM="" ;;
esac
# where to store read files
# FQ_DIR="$RUN_DIR/fastq_files"
mkdir -p $RUN_DIR/CloudStor_copy
# Prepare a general array PBS script
echo '#!/bin/bash 
#PBS -V

cd $WORKDIR
>&2 echo Current working directory: $PWD
source ~/.bashrc'"
conda activate $CONDA_PREFIX 
set -Eeo pipefail
gawk -v ARRAY_IND=\$PBS_ARRAY_INDEX 'NR==ARRAY_IND' \$CMDS_FILE | bash" > ${WORK_DIR}/array.pbspro

# Then perform adapter trimming, mapping and variant calling (snippy); decide which intermediate files can be removed
# grep "_R1.fastq.gz" $WORK_DIR/AGRF_CAGRF20114478_H3HGFDSX2_files.list | sort | parallel -k --dry-run --rpl "{infile} s:_R1:_R#:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::"  "rclone copy CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/AGRF_CAGRF20114478_H3HGFDSX2/AGRF_CAGRF20114478/ --include \"{files}\" ./{sample} ;  if [[ \$( md5sum {sample}/{files} | cut -d' ' -f1 ) == \$( egrep '{sample}_' $WORK_DIR/checksums.md5 | cut -d' ' -f1 )  ]]; then echo 'Files passed md5 check'; else exit 1 ; fi ; java -ea -XX:ParallelGCThreads=2 -Xmx4g -Xms4g -cp /home/ibar/.pyenv/versions/miniconda-latest/envs/snippy/opt/bbmap-38.90-1/current/ jgi.BBDuk ref=$BBMAP_REF ktrim=r k=23 mink=11 hdist=1 qtrim=rl trimq=10 tpe tbo minlen=30 ziplevel=8 threads=\$[NCPUS - 2] ow in={sample}/{infile} out={sample}/{sample}_R#.trimmed.fastq.gz stats={sample}.stats ow ; mkdir -p trimmed_reads/{sample}_QC ; fastqc -t \$[NCPUS - 2] -o trimmed_reads/{sample}_QC {sample}/{sample}*.trimmed.fastq.gz ; snippy --force --cleanup --cpus \$[NCPUS - 2] --ram $[RAM-2] --outdir snippy_{sample} --report --rgid {sample} --ref $GENOME.fna --R1 {sample}/{sample}_R1.trimmed.fastq.gz --R2 {sample}/{sample}_R2.trimmed.fastq.gz; rm -r {sample}/ " > $RUN_DIR/$RUN.cmds


# Then perform adapter trimming (fastp), mapping and variant calling (snippy); decide which intermediate files can be removed
# grep "_R1.fastq.gz" $WORK_DIR/AGRF_CAGRF20114478_H3HGFDSX2_files.list | sort | parallel --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::"  "rclone copy CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/AGRF_CAGRF20114478_H3HGFDSX2/AGRF_CAGRF20114478/ --include \"{files}\" ./{sample} ; cd {sample};  if [[ \$( egrep '{sample}_H3H' $WORK_DIR/checksums.md5 | md5sum -c | cut -f2 -d ' ' | uniq) == 'OK'  ]]; then >&2 echo 'Files passed md5 check'; else >&2 echo 'Files did not pass md5 check, please download again'; exit 99 ; fi ; fastp -i {} -I {file2} -c --adapter_fasta $BBMAP_REF -l 30 -p -w \$[NCPUS - 2] -z 7 -o {sample}_R1.trimmed.fastq.gz -O  {sample}_R2.trimmed.fastq.gz -h {sample}.fastp.html -j {sample}.fastp.json; fastqc -t \$[NCPUS - 2] -o {sample}_QC {sample}*.trimmed.fastq.gz ; snippy --force --cleanup --cpus \$[NCPUS - 2] --ram $[RAM-2] --outdir snippy_{sample} --report --rgid {sample} --ref $GENOME.fna --R1 {sample}_R1.trimmed.fastq.gz --R2 {sample}_R2.trimmed.fastq.gz" > $RUN_DIR/$RUN.cmds


# On Griffith HPC: perform adapter trimming (fastp), mapping and variant calling (snippy); decide which intermediate files can be removed
grep "_R1.fastq.gz" $WORK_DIR/AGRF_CAGRF20114478_H3HGFDSX2_files.list | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::"  "mkdir -p ./{sample} ;cd  {sample}; ln -sf $WORK_DIR/AGRF_CAGRF20114478/{files} ./;  if [[ \$( egrep '{sample}_H3H' $WORK_DIR/checksums.md5 | md5sum -c | cut -f2 -d ' ' | uniq) == 'OK'  ]]; then >&2 echo 'Files passed md5 check'; else >&2 echo 'Files did not pass md5 check, please download again'; exit 99 ; fi ; fastp -i {} -I {file2} -c --adapter_fasta $BBMAP_REF -l 30 -p -w \$[NCPUS - 2] -z 7 -o {sample}_R1.trimmed.fastq.gz -O {sample}_R2.trimmed.fastq.gz -h {sample}.fastp.html -j {sample}.fastp.json; snippy --force --cleanup --cpus \$[NCPUS - 2] --ram $[RAM-2] --outdir snippy_{sample} --report --rgid {sample} --ref $GENOME.fna --R1 {sample}_R1.trimmed.fastq.gz --R2 {sample}_R2.trimmed.fastq.gz " > $RUN_DIR/$RUN.cmds

# run just fastp trim
grep "_R1.fastq.gz" $WORK_DIR/AGRF_CAGRF20114478_H3HGFDSX2_files.list | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::"  "mkdir -p ./{sample} ;cd  {sample}; ln -sf $WORK_DIR/AGRF_CAGRF20114478/{files} ./;  if [[ \$( egrep '{sample}_H3H' $WORK_DIR/checksums.md5 | md5sum -c | cut -f2 -d ' ' | uniq) == 'OK'  ]]; then >&2 echo 'Files passed md5 check'; else >&2 echo 'Files did not pass md5 check, please download again'; exit 99 ; fi ; fastp -i {} -I {file2} -c --adapter_fasta $BBMAP_REF -l 30 -p -w \$[NCPUS - 2] -z 7 -o {sample}_R1.trimmed.fastq.gz -O {sample}_R2.trimmed.fastq.gz -h {sample}.fastp.html -j {sample}.fastp.json" > $RUN_DIR/fastp_trim.cmds


# find . -maxdepth 1 -name "*_R1.fastq.gz" | parallel --dry-run --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::"  "genozip {files} -f -@ \$[NCPUS - 2] -2 -E $GENOME.ref.genozip -o $RUN_DIR/CloudStor_copy/{sample}_H3HGFDSX2.fq.genozip; rm {files} trimmed_reads/{sample}_R?.recal.trimmed.fastq.gz" > genozip_clean.cmds

# GENOZIP_CLEAN=$(qsub -J1-$(cat $RUN_DIR/genozip_clean.cmds | wc -l) -l select=1:ncpus=6:mem=12GB,walltime=1:00:00 -N genozip_clean -A qris-gu -v "CMDS_FILE=$RUN_DIR/genozip_clean.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# recalibrate requires a reference 
# bbduk.sh in=stdin.fq out=stdout.fq int recalibrate | filterbytile.sh in={infile} out=stdout.fq int=f | clumpify.sh in=stdin.fq out=stdout.fq int dedupe usetmpdir optical $CLUMP_PARAM adjacent 

# add --cleanup to snippy
# make a folder for the log files
mkdir -p $RUN_DIR/pbs_logs
cd !$

# run fastp trim
FASTP_TRIM=$(qsub -J1-$(cat $RUN_DIR/fastp_trim.cmds | wc -l) -l select=1:ncpus=6:mem=8GB,walltime=1:00:00 -N fastp_trim -A qris-gu -v "CMDS_FILE=$RUN_DIR/fastp_trim.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# save failed commands to file
grep "job killed"  fastp_trim.e$FASTP_TRIM.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN_DIR/fastp_trim.cmds > $RUN_DIR/fastp_trim.cmds.$FASTP_TRIM.failed
# rerun failed jobs
cd pbs_logs
FASTP_TRIM=$(qsub -J1-$(cat $RUN_DIR/fastp_trim.cmds.$FASTP_TRIM.failed | wc -l) -l select=1:ncpus=8:mem=4GB,walltime=2:00:00 -N fastp_trim -A qris-gu -v "CMDS_FILE=$RUN_DIR/fastp_trim.cmds.$FASTP_TRIM.failed","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# test run
TEST_SNIPPY=$(qsub -J195-196 -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# run for all files
STEP_SIZE=10

TRIM_SNIPPY=$(qsub -J1-$(cat $RUN_DIR/$RUN.cmds | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
#TRIM_SNIPPY2=$(qsub -J2-$(cat $RUN_DIR/$RUN.cmds | wc -l):$STEP_SIZE -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=1:00:00 -N ${RUN:0:11} -W depend=afterok:$TRIM_SNIPPY[] -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# check for successful commands and save them to a file
cd $RUN_DIR
egrep "ExitStatus:[0]" pbs_logs/*.e$TRIM_SNIPPY.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN.cmds > $RUN.$TRIM_SNIPPY.succeed
# check for failed commands and save them to a file
egrep "ExitStatus:[^0]" pbs_logs/*.e$TRIM_SNIPPY.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN.cmds > $RUN.$TRIM_SNIPPY.failed

# rerun failed commands (modify pbs file if needed to increase resources)
cd pbs_logs
TRIM_SNIPPY2=$(qsub -J1-$(cat $RUN_DIR/$RUN.$TRIM_SNIPPY.failed | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.$TRIM_SNIPPY.failed","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# repeat as needed
egrep "ExitStatus:[^0]" pbs_logs/*.e$TRIM_SNIPPY2.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN.$TRIM_SNIPPY.failed > $RUN.$TRIM_SNIPPY2.failed
# rerun failed jobs
cd pbs_logs
TRIM_SNIPPY3=$(qsub -J1-$(cat $RUN_DIR/$RUN.$TRIM_SNIPPY2.failed | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.$TRIM_SNIPPY2.failed","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# repeat as needed
egrep "ExitStatus:[^0]" pbs_logs/*.e$TRIM_SNIPPY3.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN.$TRIM_SNIPPY2.failed > $RUN.$TRIM_SNIPPY3.failed
# rerun failed jobs
cd pbs_logs
TRIM_SNIPPY_NEW=$(qsub -J1-$(cat $RUN_DIR/$NEW_RUN.bash | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=7:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$NEW_RUN.bash","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# find incomplete jobs
cd $RUN_DIR
find . -name "snps.aligned.fa" | cut -f 3 -d / | gawk '
{printf "%s|", $1}' | sed 's/|$//' |  egrep -vf - $RUN_DIR/$RUN.cmds > $RUN_DIR/$RUN.cmds.failed
# rerun failed jobs
cd pbs_logs
TRIM_SNIPPY3=$(qsub -J1-$(cat $RUN_DIR/$RUN.cmds.failed | wc -l) -l select=1:ncpus=${NCORES}:mem=32GB,walltime=2:00:00 -N ${RUN:0:11} -v "CMDS_FILE=$RUN_DIR/$RUN.cmds.failed","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# find duplicate sample names that didn't work initially
egrep "AR0052_H3H|AR0039_H3H|AR0242_H3H" $RUN_DIR/$RUN.cmds > $RUN_DIR/$RUN.cmds.retry
TRIM_SNIPPY_RETRY=$(qsub -J1-$(cat $RUN_DIR/$RUN.cmds.retry | wc -l) -l select=1:ncpus=${NCORES}:mem=32GB,walltime=2:00:00 -N ${RUN:0:11} -v "CMDS_FILE=$RUN_DIR/$RUN.cmds.retry","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# remove empty log files
find . -type f -size 0 -name "${RUN:0:11}.*" -exec rm {} +

# Run snippy-core on the output files
cd $RUN_DIR
CORE_JOB=snippy_core_"$(date +%d_%m_%Y)"
SNPY_FOLDS=$( find . -type d -name "snippy_AR*" | xargs )
SNPY_REF="'$( find . -type d -name "snippy_AR*" | head -n1 )/reference/ref.fa'"

# CORE_CMD=$( tail -n1 ${RUN}.bash )
SNPY_CORE=$( echo "cd $( pwd ) ; source ~/.bashrc ; conda activate $CONDA_PREFIX ; snippy-core --prefix=CORE_JOB --ref $SNPY_REF $SNPY_FOLDS " | qsub -V -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=2:00:00  -N ${CORE_JOB:0:11} | egrep -o "[0-9]{7}" ) # 5248661.pbsserver


# cd $RUN_DIR
# find . -maxdepth 1 -type d -name "AR*" | cut -f2 -d"/" | parallel --dry-run "mkdir -p {}/{}_QC; cd {};  fastp -i {}*_R1.trimmed.fastq.gz -I {}*_R2.trimmed.fastq.gz -w $[NCPUS - 2] -h {}_QC/{}.fastp.html -j  {}_QC/{}.fastp.json -A -Q -G -m --stdout" > fastp_trim_qc.cmds

# FASTP_QC=$(qsub -J1-$(cat $RUN_DIR/fastp_trim_qc.cmds | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/fastp_trim_qc.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# record the array ID: 5247580[] Macrogen batch on Gowonda

# multiqc report
MULTIQC_JOB="${RUN}_QC"
echo "cd $( pwd ) ; multiqc -i $MULTIQC_JOB -o $MULTIQC_JOB ." | qsub -V -l select=1:ncpus=12:mem=4GB,walltime=1:00:00 -N ${MULTIQC_JOB:0:11} -W depend=afterok:$QUALIMAP_JOB_ID # 5248672.pbsserver

# collect trimming summary
grep "Result:" ${RUN}.log | gawk 'NR%2==0' | gawk 'BEGIN{per=0}{per+=gensub(/.([0-9]+\.[0-9]+).+/, "\\1", "1", $7); avg=per/NR}END{printf("Average percentage of bases kept after trimming: %.2f%% \n",avg)}' > $RUN.bbduk.stats
# Average percentage of bases kept after trimming: 97.76%
# Number of files processed: 72
grep "Input:" ${RUN}.log | gawk 'NR%2==0' | gawk 'BEGIN{reads=0};{reads+=$2; avg=reads/NR}END{printf("Average number of reads per file: %.2f \nTotal number of reads: %d\nNumber of files processed: %d\n",avg, reads,NR)}' >> $RUN.bbduk.stats
# Average number of reads per file: 4243748.83
# Total number of reads: 305549916
# find and remove empty files
find . -size 0 -exec rm {} + 
# Check that all jobs finished successfuly
find . -regextype posix-egrep -regex '\./.*\.e[0-9]{7}.*' | xargs grep "ExitStatus" #  *m.e$JOB_ID.*
# remove temp files
rm raw.sam raw_R*.fq.gz
# Done!

Mapping to the updated reference genome

Reads are mapped to the A. rabiei reference genome assembled and annotated by Fredrick Mobegi (the CCDM, Curtin University, add NCBI accession when available) using bwa-mem2 v2.2.1 [@vasimuddinEfficientArchitectureAwareAcceleration2019]. The alignment files were then coordinate-sorted and had PCR duplicates marked using bamsormadup from biobambam2 v2.0.182 [@tischlerBiobambamToolsRead2014].
To improve performance, temporary files were written to a local lscratch folder on the computing node.

REF_DIR=$HOME/data/reference_genomes/Ascochyta_reference_genomes/Updated_ME14_reference_and_annotations # on Griffith HPC
# REF_DIR=$HOME/data/reference_genomes # on Awoonga HPC
GENOME="$REF_DIR/ArME14"
ln -s $REF_DIR/Ascochyta_rabiei_ArME14.scaffolds.fa $GENOME.fa 
ln -s $REF_DIR/ArME14_all_annotations.updated.gff3 $GENOME.gff3 
gff2bed < $GENOME.gff3 > $GENOME.bed
bwa-mem2 index $GENOME.fa 
samtools faidx $GENOME.fa

Prepare and run the pipeline to trim the reads

WORK_DIR=$HOME/scratch/A_rabiei_WGS_2021
CONDA_NAME=/export/home/s2978925/miniconda3/envs/snippy
conda activate $CONDA_NAME
BBMAP_REF=$(find $CONDA_PREFIX -wholename "*resources/adapters.fa")
# Prepare the commands
DATE=`date +%d_%m_%Y`
BATCH=All_2018_2020
RUN="${BATCH}_${DATE}" # day of run was 02_02_2019
RUN_DIR=$WORK_DIR/${RUN}
mkdir -p $RUN_DIR
cd $RUN_DIR
# GENOME="$HOME/scratch/data/reference_genomes/Ascochyta_reference_genomes/A.rabiei_me14"
# READ_LENGTH=150
# PLOIDY=1
NCORES=12
RAM=28
# LANES=$( ls -1 ../*.fastq.gz | egrep -o "_L[0-9]+?_" | sort | uniq | wc -l )
RGPM=NovaSeq
RGPL=ILLUMINA
# RGPU=H3HGFDSX2
# RGCN=AGRF
# COMMON_ID="DO16094321_HFKYMALXX"

# where to store read files
# FQ_DIR="$RUN_DIR/fastq_files"
# mkdir -p $RUN_DIR/CloudStor_copy
# Prepare a general array PBS script
echo '#!/bin/bash 
#PBS -V

cd $WORKDIR
>&2 echo Current working directory: $PWD
source ~/.bashrc'"
conda activate $CONDA_PREFIX 
set -Eeo pipefail
gawk -v ARRAY_IND=\$PBS_ARRAY_INDEX 'NR==ARRAY_IND' \$CMDS_FILE | bash" > ${WORK_DIR}/array.pbspro

# Prepare a parallel array PBS script
echo '
ARRAYID=$( echo $PBS_JOBID | egrep -o "^[0-9]+" )
gawk -v ARRAY_IND=$PBS_ARRAY_INDEX -v step=$STEP_SIZE '"'NR>=ARRAY_IND && NR<(ARRAY_IND+step)'"' $CMDS_FILE | parallel -k --joblog $PBS_O_WORKDIR/${PBS_JOBNAME}.p$ARRAYID.$PBS_ARRAY_INDEX' | cat <(head -n -1  ${WORK_DIR}/array.pbspro) - > ${WORK_DIR}/parallel_array.pbspro

# On Griffith HPC: 
# link all fastq files and rename them the same way
# AgVic batch
ln -s $WORK_DIR/AgVic_WGS2018/*.fastq.gz ./
rename -v 's/S[0-9]+_L003_//' *.fastq.gz
# AGRF 2019
ln -s $WORK_DIR/AGRF_CAGRF19461/*.fastq.gz ./
# Macrogen
ln -s $WORK_DIR/Macrogen_sequences_1702KHP-0164/*.fastq.gz ./
# 2020 AGRF
ln -s $WORK_DIR/AGRF_CAGRF20114478/*.fastq.gz ./
rename -v 's/H3HGFDSX2_[ACGT-]+_L002_//' *.fastq.gz
# ignore contaminated/shallow files
# IGNORE_SAMS='11_R|5B_R|2_R|11_R|15CUR005_R|AGRF_18_R|AGRF_08_R|F15023_R|20_R'
printf "AR0106\nAR0107\nAR0113\nAR0118\n11\n5B\n2\n11\n15CUR005\nAGRF_18\nAGRF_08\nF15023\n20" > ignore_samples.txt
IGNORE_RE=$(awk '{printf "^%s_R|",$1}' ignore_samples.txt | sed 's/|$//')
# perform adapter trimming (fastp) 
ls -1 *_R1.fastq.gz  | egrep -v $IGNORE_RE | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_R1.+::"  "mkdir -p ./{sample} ;cd  {sample}; ln -sf $RUN_DIR/{files} ./; fastp -i {} -I {file2} -c --adapter_fasta $BBMAP_REF -l 30 -p -w \$[NCPUS - 2] -z 7 -o {sample}_R1.trimmed.fastq.gz -O {sample}_R2.trimmed.fastq.gz -h {sample}.fastp.html -j {sample}.fastp.json; " > $RUN_DIR/fastp_trim.cmds


# find . -maxdepth 1 -name "*_R1.fastq.gz" | parallel --dry-run --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::"  "genozip {files} -f -@ \$[NCPUS - 2] -2 -E $GENOME.ref.genozip -o $RUN_DIR/CloudStor_copy/{sample}_H3HGFDSX2.fq.genozip; rm {files} trimmed_reads/{sample}_R?.recal.trimmed.fastq.gz" > genozip_clean.cmds

# GENOZIP_CLEAN=$(qsub -J1-$(cat $RUN_DIR/genozip_clean.cmds | wc -l) -l select=1:ncpus=6:mem=12GB,walltime=1:00:00 -N genozip_clean -A qris-gu -v "CMDS_FILE=$RUN_DIR/genozip_clean.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# recalibrate requires a reference 
# bbduk.sh in=stdin.fq out=stdout.fq int recalibrate | filterbytile.sh in={infile} out=stdout.fq int=f | clumpify.sh in=stdin.fq out=stdout.fq int dedupe usetmpdir optical $CLUMP_PARAM adjacent 

# add --cleanup to snippy
# make a folder for the log files
mkdir -p $RUN_DIR/pbs_logs
cd !$

# run fastp trim
FASTP_TRIM=$(qsub -J1-$(cat $RUN_DIR/fastp_trim.cmds | wc -l) -l select=1:ncpus=8:mem=4GB,walltime=3:00:00 -N fastp_trim -v "CMDS_FILE=$RUN_DIR/fastp_trim.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# combine stderr files to log
cat *.e$FASTP_TRIM.* > fastp_trim.$FASTP_TRIM.log

# save failed commands to file
grep "job killed"  fastp_trim.e$FASTP_TRIM.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN_DIR/fastp_trim.cmds > $RUN_DIR/fastp_trim.cmds.$FASTP_TRIM.failed
# rerun failed jobs

# align the trimmed reads to the genome
cd $RUN_DIR
ls -1 *_R1.fastq.gz  | egrep -v $IGNORE_RE | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_R1.+::"  "cd  {sample}; bwa-mem2 mem -R \"@RG\tID:{sample}\tSM:{sample}\" -t \$[NCPUS - 2] $GENOME.fa {sample}_R1.trimmed.fastq.gz {sample}_R2.trimmed.fastq.gz | bamsormadup inputformat=sam threads=\$[NCPUS - 2] > {sample}.dedup.rg.csorted.bam" > $RUN_DIR/align_reads.cmds 

# test run
TEST_BWA=$(qsub -J1-2 -l select=1:ncpus=8:mem=8GB,walltime=2:00:00 -N align_reads -v "CMDS_FILE=$RUN_DIR/align_reads.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# run for all files
ALIGN_BWA=$(qsub -J3-$(cat $RUN_DIR/align_reads.cmds | wc -l) -l select=1:ncpus=8:mem=8GB,walltime=2:00:00 -N align_reads -v "CMDS_FILE=$RUN_DIR/align_reads.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

STEP_SIZE=5 # the processing is very quick (<20 minutes) and it would be more efficient for each job to process 5 commands at a time

IGNORE_SAMS=$(awk '{printf "^%s\t|",$1}' $RUN_DIR/ignore_samples.txt | sed 's/|$//')
# merge sam files of duplicated samples and rename to isolate name
gawk -F "\t" '$3 ~ /Ascochyta rabiei/ && $6 ~ /Illumina/' $WORK_DIR/WGS_sequencing_information_summary.txt | egrep -v "$IGNORE_SAMS"  | tail -n+2 | sort -k2 | gawk  -F "\t" -v ORS="" 'BEGIN{cmd="picard MergeSamFiles USE_THREADING=true QUIET=true COMPRESSION_LEVEL=0 O=/dev/stdout"}NR==1{printf "%s I=%s/%s.dedup.rg.csorted.bam", cmd, $1, $1; isolate=$2; split($6, a , " "); next}isolate!=$2{printf " | picard AddOrReplaceReadGroups SO=coordinate I=/dev/stdin CREATE_INDEX=true ID=%s LB=%s PL=Illumina PU=%s SM=%s O=%s.csorted.bam\n%s I=%s/%s.dedup.rg.csorted.bam", isolate, isolate, a[2], isolate,  isolate, cmd, $1, $1; isolate=$2; split($6, a , " "); next}isolate==$2{printf " I=%s/%s.dedup.rg.csorted.bam", $1, $1}END{printf " | picard AddOrReplaceReadGroups SO=coordinate I=/dev/stdin CREATE_INDEX=true ID=%s LB=%s PL=Illumina PU=%s SM=%s O=%s.csorted.bam\n", isolate, isolate, a[2], isolate,  isolate}'  > $RUN_DIR/merge_bams.cmds


# test run (if not fast enough replace with sambamba merge and samtools addreplacerg )
TEST_MERGE=$(qsub -J1-2 -l select=1:ncpus=8:mem=16GB,walltime=2:00:00 -N merge_bams -v "CMDS_FILE=$RUN_DIR/merge_bams.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# run for all files
MERGE_BAMS=$(qsub -J3-$(cat $RUN_DIR/merge_bams.cmds | wc -l) -l select=1:ncpus=8:mem=16GB,walltime=1:30:00 -N merge_bams -v "CMDS_FILE=$RUN_DIR/merge_bams.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# find failed runs and repeat
cd $RUN_DIR
find . -maxdepth 1 -name "*.csorted.bam" -size 1M | cut -f 2 -d "/" | grep -f - merge_bams.cmds > merge_bams.cmds.failed
# repeat run
echo "cd $RUN_DIR ; source ~/.bashrc; conda activate $CONDA_NAME; cat merge_bams.cmds.failed | parallel" | qsub -l select=1:ncpus=8:mem=16GB,walltime=2:00:00 -N merge_bams  

# index BAMs (if not done previously while adding Read Groups)
BAM_INDEX=$( echo "cd $RUN_DIR ; source ~/.bashrc; conda activate $CONDA_NAME; ls -1 *.csorted.bam | parallel 'sambamba index -t4 {} '" | qsub -V -l select=1:ncpus=16:mem=16GB,walltime=2:00:00 -N index_bams | egrep -o "^[0-9]+")

# Run qualimap on bam files (files need to be sorted)
# QUALIMAP_JOB="qualimap_${RUN}"
# Qualimap bamqc
cd $RUN_DIR
ls -1 *_R1.fastq.gz  | egrep -v $IGNORE_RE | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_R1.+::"  "unset DISPLAY; cd  $RUN_DIR/{sample}; qualimap bamqc -bam {sample}.dedup.rg.csorted.bam --java-mem-size=4G -c -gff $GENOME.bed -outdir $RUN_DIR/{sample}/{sample}_bamqc" > qualimap_bamqc.cmds
cd $RUN_DIR/pbs_logs
# submit to the cluster
STEP_SIZE=5 # the processing is very quick (<20 minutes) and it would be more efficient for each job to process 5 commands at a time
QUALIMAP=$(qsub -J1-$(cat $RUN_DIR/qualimap_bamqc.cmds | wc -l):$STEP_SIZE -l select=1:ncpus=8:mem=16GB,walltime=2:00:00 -N qualimap_bamqc -v "CMDS_FILE=$RUN_DIR/qualimap_bamqc.cmds","WORKDIR=$RUN_DIR","STEP_SIZE=$STEP_SIZE" ${WORK_DIR}/parallel_array.pbspro | egrep -o "^[0-9]+")


# Multi-bamqc: create/copy a file describing the sample names (sample_info.txt) to the parent folder and use it in qualimap 
# QUALIMAP=$( echo "cd $RUN_DIR ; source ~/.bashrc; conda activate $CONDA_NAME; unset DISPLAY ; ls -1 *.csorted.bam | gawk '{sample_name=gensub(/.csorted.bam/, \"\", \$1); printf \"%s\t%s\n\", sample_name, \$1}'  > $QUALIMAP_JOB.samples ; qualimap multi-bamqc --java-mem-size=4G -r -c -gff $GENOME.bed -d $QUALIMAP_JOB.samples -outformat PDF:HTML -outdir $QUALIMAP_JOB " | qsub -V -l select=1:ncpus=${NCORES}:mem=32GB,walltime=5:00:00 -N ${QUALIMAP_JOB:0:11} | egrep -o "^[0-9]+")

# remove empty log files
find . -type f -size 0 -name "${RUN:0:11}.*" -exec rm {} +

# multiqc report
MULTIQC_JOB=QC_$(date +%d_%m_%Y)
echo "source ~/.bashrc; conda activate $CONDA_NAME; cd $RUN_DIR; set -Eeo pipefail; multiqc -i $MULTIQC_JOB -o $MULTIQC_JOB ." | qsub -V -l select=1:ncpus=12:mem=4GB,walltime=2:00:00 -N ${MULTIQC_JOB:0:11}  # 5248672.pbsserver

# collect trimming summary
grep "Result:" ${RUN}.log | gawk 'NR%2==0' | gawk 'BEGIN{per=0}{per+=gensub(/.([0-9]+\.[0-9]+).+/, "\\1", "1", $7); avg=per/NR}END{printf("Average percentage of bases kept after trimming: %.2f%% \n",avg)}' > $RUN.bbduk.stats
# Average percentage of bases kept after trimming: 97.76%
# Number of files processed: 72
grep "Input:" ${RUN}.log | gawk 'NR%2==0' | gawk 'BEGIN{reads=0};{reads+=$2; avg=reads/NR}END{printf("Average number of reads per file: %.2f \nTotal number of reads: %d\nNumber of files processed: %d\n",avg, reads,NR)}' >> $RUN.bbduk.stats
# Average number of reads per file: 4243748.83
# Total number of reads: 305549916
# find and remove empty files
find . -size 0 -exec rm {} + 
# Check that all jobs finished successfully
find . -regextype posix-egrep -regex '\./.*\.e[0-9]+.*' | xargs grep "ExitStatus" #  *m.e$JOB_ID.*
# remove temp files
rm raw.sam raw_R*.fq.gz
# Done!

Calling variants (using Freebayes)

PLOIDY=1
DATE=`date +%d_%m_%Y`
RUN=FB_all_${DATE} # day of run was 02_02_2019
mkdir -p $WORK_DIR/${RUN}
cd !$
# JOB_NAME="BBmap_me14_vars"
GENOME="$HOME/scratch/data/reference_genomes/Ascochyta_reference_genomes/A.rabiei_me14"
NCORES=8


# Distributed freebayes (each node runs freebayes-parallel on one contig)
# download script
wget https://raw.githubusercontent.com/freebayes/freebayes/master/scripts/split_ref_by_bai_datasize.py ~/bin/
chmod +x ~/bin/split_ref_by_bai_datasize.py
conda install -n $CONDA_NAME numpy scipy
# split each contig/chromosome to smaller 1e6 bits
~/bin/split_ref_by_bai_datasize.py -s 4e5 -r $GENOME.fa.fai $RUN_DIR/AR0321.csorted.bam > $RUN_DIR/ArME14_target_regions_chr.bed
ln -s $RUN_DIR/ArME14_target_regions_chr.tsv $RUN_DIR/ArME14_target_regions_chr.bed
# prepare commands
BAM_FILES=$( find $RUN_DIR -maxdepth 1 -name "*.csorted.bam" -size +1M  | gawk -vORS=" " '1' )
cut -f1 $GENOME.fa.fai | parallel --dry-run "freebayes-parallel <(grep '{}' ArME14_target_regions_chr.tsv | gawk '{printf \"%s:%s-%s\n\", \$1, \$2, \$3}') \$NCPUS  -f $GENOME.fa -p1 $BAM_FILES > {}.combined.vcf" > freebayes.cmds

cd $RUN_DIR/pbs_logs
# send to the cluster
FREEBAYES=$(qsub -J1-$(cat $RUN_DIR/freebayes.cmds | wc -l) -l select=1:ncpus=8:mem=64GB,walltime=24:00:00 -N freebayes -v "CMDS_FILE=$RUN_DIR/freebayes.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# check status of jobs
seq 1 $(cat $RUN_DIR/freebayes.cmds | wc -l) | parallel "qstat -fx $FREEBAYES[{}] | head"
# repeat failed jobs
FREEBAYES_FAILED=$(qsub -J7-17:10 -l select=1:ncpus=8:mem=64GB,walltime=36:00:00 -N freebayes -v "CMDS_FILE=$RUN_DIR/freebayes.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")

# merge ind files to a single bam file (keep submission ids)
BAMS=$( find $RUN_DIR -maxdepth 1 -name "*.csorted.bam" -size +1M  | gawk -v ORS=" " '{print "I="$1}' )
# BAM_FILES=$( find $RUN_DIR -maxdepth 1 -name "*.csorted.bam" -size +1M  | gawk -vORS=" " '1' )
# cd $RUN_DIR/pbs_logs
# merge with sambamba (memory allocation issues)
# FINAL_MERGE=$( echo "source ~/.bashrc; conda activate $CONDA_NAME; cd $RUN_DIR; set -Eeo pipefail; sambamba merge -t $[NCPUS - 2] All_samples.csorted.combined.bam $BAM_FILES; sambamba depth base --combined All_samples.csorted.combined.bam | cut -f 1-3 | tail -n +2 | coverage_to_regions.py $GENOME.fa.fai 50 > ArME14_targets.regions " | qsub -V -l select=1:ncpus=${NCORES}:mem=32GB,walltime=2:00:00 -N Final_merge | egrep -o "^[0-9]+") # 5248672.pbsserver
# merge with picard (not needed)
FINAL_MERGE=$( echo "source ~/.bashrc; conda activate $CONDA_NAME; cd $RUN_DIR; set -Eeo pipefail; picard MergeSamFiles CREATE_INDEX=true R=$GENOME.fa SO=coordinate $BAMS O=All_samples.csorted.combined.bam" | qsub -V -l select=1:ncpus=${NCORES}:mem=32GB,walltime=3:00:00 -N Final_merge | egrep -o "^[0-9]+")

# calculate coverage
ln -s $RUN_DIR/All_samples.csorted.combined.bai $RUN_DIR/All_samples.csorted.combined.bam.bai 
# conda deactivate
~/bin/split_ref_by_bai_datasize.py -s 100e6 -r $GENOME.fa.fai $RUN_DIR/All_samples.csorted.combined.bam > $RUN_DIR/All_ArME14_targets_regions.bed
# run freebayes on combined BAM (stringent)
cut -f1 $GENOME.fa.fai | parallel --dry-run "freebayes-parallel <(grep '{}' $RUN_DIR/All_ArME14_targets_regions.bed | gawk '{printf \"%s:%s-%s\n\", \$1, \$2, \$3}') \$NCPUS  -f $GENOME.fa -p1 -g 100000 -0 -C 5 $RUN_DIR/All_samples.csorted.combined.bam > $RUN_DIR/All_samples.{}.combined.stringent.vcf" > $RUN_DIR/freebayes_stringent.cmds
# run freebayes on combined BAM
FREEBAYES_STRING=$(qsub -J1-$(cat $RUN_DIR/freebayes_stringent.cmds | wc -l) -l select=1:ncpus=8:mem=64GB,walltime=36:00:00 -N freebayes_string -v "CMDS_FILE=$RUN_DIR/freebayes_stringent.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# save read groups to file
sambamba view -H All_samples.csorted.combined.bam | grep '^@RG' > read_groups.txt


# filter the files and combine to a single VCF file, 
CONCAT_FB=$(echo "set -Eeo pipefail; source ~/.bashrc; conda activate $CONDA_NAME; cd $RUN_DIR ; cat $RUN_DIR/All_samples.*.combined.stringent.vcf | vcffirstheader | vcfstreamsort -w 1000 | vcfuniq > A_rabiei_2018_2020.bwa.fb.stringent.vcf" | qsub -V -l select=1:ncpus=8:mem=16GB,walltime=20:00:00 -N FB_concat | egrep -o "[0-9]+" )

MIN_DP=5
ls -1 ArME14_*.combined.vcf  | parallel --dry-run "printf \"{}\t%s\t%s\t%s\t%s\n\" \$(cat {} | grep -c -v '^#') \$(cat {} | SnpSift filter \"( GEN[?].DP > $MIN_DP ) & ( GEN[?].GT != './.' )\" | gawk '{if (\$0 ~ /^#/ || (length(\$4)==1 && length(\$5)==1)); print \$0}' | grep -c -v '^#') \$(cat {} | SnpSift filter \"( GEN[?].DP > $MIN_DP ) & ( GEN[?].GT != './.' ) & ( QUAL > 20 )\" | gawk '{if (\$0 ~ /^#/ || (length(\$4)==1 && length(\$5)==1)); print \$0}' | grep -c -v '^#') \$(cat {} | SnpSift filter \"( GEN[?].DP > $MIN_DP ) & ( GEN[?].GT != './.' ) & ( QUAL > 30 )\" | gawk '{if (\$0 ~ /^#/ || (length(\$4)==1 && length(\$5)==1)); print \$0}' | grep -c -v '^#') > {}.stats" > $RUN_DIR/vcf_stats.cmds

cd $RUN_DIR/pbs_logs

# Calculate contig stats
VCF_STATS=$(qsub -J1-$(cat $RUN_DIR/vcf_stats.cmds | wc -l) -l select=1:ncpus=8:mem=16GB,walltime=2:00:00 -N vcf_stats -v"CMDS_FILE=$RUN_DIR/vcf_stats.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")


cat <(printf "file\ttotal_snps\tDP${MIN_DP}_filtered_snps\tQUAL20_filtered_snps\tQUAL30_filtered_snps\n") <(cat $RUN_DIR/*stringent.vcf.stats) > vcf_stats_stringent_$DATE.txt



# find and remove empty files
find . -size 0 -exec rm {} +

# check jobs completion
find . -regextype posix-egrep -regex '\./.*\.e[0-9]{7}.*' | xargs grep "ExitStatus"

# check for success/failure of FB commands
tail -n +2 $FB.cmds.p*.* | cut -f7,9 | gawk '$1==1' > $FB.failed.cmds
tail -n +2 $FB.cmds.p*.* | cut -f7,9 | gawk '$1==0' > $FB.successful.cmds
rm $FB.cmds.p*.*

Samples with less than 20% mapping and x15 coverage were removed from the rest of the analysis (see highlighted rows in Table 2).

Error rates were estimated by individually processing the duplicate samples and assessing the rate of mismatch variant calls within replicates of the same isolate. To process mapping table, copy the multisampleBamQcReport.html file from the server to the local analysis folder

datatable(as.data.frame(qualimap_results), # %>% mutate_if(is.numeric, ~round(., 2))),
          caption=tbls("mapping_rates")) %>% #, 
  formatRound(~Coverage+GC_percentage+Mapping_quality_mean) %>% 
  formatPercentage('Error_rate', 2) %>% 
          # options = list(dom = 'tf', pageLength = 40)) %>%
  formatStyle("Coverage", target = "row",
  backgroundColor = styleInterval(c(10,20), c('#ef3125', '#feed95', NA))
)

# pander(as.data.frame(qualimap_results), caption=tbls("mapping_rates"), justify="left")

# datatable(as.data.frame(mapping_stats), caption = tbls("mapping_rates"), # , width=10
#           style="default", rownames = FALSE, autoHideNavigation=TRUE) %>% 
#   formatPercentage('Mapping_rate', 2) %>% #formatStyle('Isolate',
#   formatStyle('Mapping_rate',
#     color = styleInterval(c(0.05, 0.5), c('yellow', 'blue', 'black')),
#     backgroundColor = styleInterval(c(0.05, 0.5), c('red', 'yellow',NA))
#   )

Population Genetics

The core SNPs were initially filtered by TASSEL [@BradburyTASSELsoftwareassociation2007] and then imported to R and were used to analyse the population structure using a range of popgen packages (adegenet, poppr, mmod and more) [@Jombartadegenet31new2011a; @KamvarPopprpackagegenetic2014; @WinterMMODlibrarycalculation2012].

The SNP data was used to calculate a distance matrix between all samples, which was then used as inpur for principle component analysis (PCA). The PCA was plotted as an interactive plotly plot to allow exploration of the data and selection of isolates for further experiments/screening based on their phenotype, collection origin and genetic makeup.

Variant Annotation

Variants were then further filtered to keep only polymorphic and bi-allelic SNPs using the VariantAnnotation v1.28.11 and GenomicRanges v1.34.0 R Bioconductor packages [@obenchain_variantannotation:_2014; @lawrence_software_2013]. The SNPs were annotated and position of each variant relative to gene locations was determined (coding/intron/promotor/intergenic), based on the reference genome and gene models of A. rabiei strain Me14 (where the contig names were shortened to remove the Arab_Me14 prefix).

cat A.rabiei_me14.fasta | sed 's/Arab_Me14_//' > A_rabiei_me14_short_names.fasta
zcat Arab_me14.gff3.gz | sed 's/Arab_Me14_//' > Arab_me14_short_names.gff

Variants in coding regions of genes were identified as Synonymous/Non-synonymous/Non-sense mutations if they were silent, changing the amino acid or the reading frame, respectively. Additional annotation of the genes at each SNP site was performed by a BLASTp search against the NCBI non-redundant protein database (nr) and scanning against the InterPro conglomerate dbs. Effector prediction of the variant-associated genes was performed by EffectorP v1.0/2.0 [@SperschneiderEffectorPpredictingfungal2016;@JanaImprovedpredictionfungal2018]. The variants were summarised and visualised across the genome scaffolds and visualised using circlize v0.4.8 R package [@gu_circlize_2014].

# Download the python wrapper

$HPC_GRID_RUNNER_DIR/BioIfx/hpc_FASTA_GridRunner.pl --cmd_template "cd $PBS_O_WORKDIR; pyenv shell miniconda3-latest; ~/etc/tools/Annotation/InterPro/iprscan5_urllib3.py --sequence=__QUERY_FILE__ --outfile=__QUERY_FILE__.interpro.tsv --outformat=tsv --goterms --pathways --email=i.bar@griffith.edu.au " --query_fasta Assoc_SNP_genes_cds_05_02_2018.fasta -G $HPC_GRID_RUNNER_DIR/hpc_conf/small_PBS_jobs.conf -N 1 -O SNP_interpro

Look for particular effector genes and genes associated with plant-pathogen interactions and pathogenicity genes. Look in the literature and create a list of target genes.

Variant-Pathogenicity association

Association between variants and pathogenicity levels was identified by SNPassoc v1.9.2 R package [@gonzalez_snpassoc:_2007], using a codominant gene model and a significance threshold of p-value \(\le0.005\).

Appendix 2. Useful resources

  • Whole-Genome Comparison of Aspergillus fumigatus Strains Serially Isolated from Patients with Aspergillosis. [@hagiwara_whole-genome_2014]:

Sequence analysis: The Illumina data sets were trimmed using fastq-mcf in ea-utils (version 1.1.2-484), i.e., sequencing adapters and sequences with low quality scores (Phred score [Q], <30) were removed (24). The data sets were mapped to the genome sequence of the A. fumigatus genome reference strain Af293 (29,420,142 bp, genome version s03-m04-r03) (25, 26) using Bowtie 2 (version 2.0.0-beta7) with the very sensitive option in end-to-end mode (27). Duplicated reads were removed using Picard (version 1.112) (http://picard.sourceforge.net). The programs mpileup and bcftools from SAMtools (version 0.1.19-44428cd) were used to perform further quality controls. In mpileup, the -q20 argument was used to trim reads with low-quality mapping, whereas the argument -q30 was used to trim low-quality bases at the 3' end (28). The bcftools setting was set to -c in order to call variants using Bayesian inference. Consensus and single nucleotide polymorphisms (SNPs) were excluded if they did not meet a minimum coverage of 5x or if the variant was present in <90% of the base calls (29, 30). The genotype field in the variant call format (VCF) files indicates homozygote and heterozygote probabilities as Phred-scaled likelihoods. SNPs were excluded if they were called as heterozygous genotypes using SAMtools. The mapping results were visualized in the Integrative Genomics Viewer (version 2.3.3) (31, 32). The reference genome data included information on open reading frames and annotations, from which the SNPs were designated non-synonymous or synonymous.
Single nucleotide mutations were confirmed by Sanger sequencing. Regions of approximately 400 bp that contained a mutation were amplified with appropriately designed primer pairs and then sequenced. The primer sequences are listed in Table S1 in the supplemental material, which were named as follows. For verification of the SNPs in strains from patient I or patient II, PaI or PaII was added to the primer name, respectively. For non-synonymous SNPs, synonymous SNPs, or SNPs in a non-coding region, (NS, Syno, NonC) was added to the primer name, respectively.
Analysis of unmapped reads: De novo assembly of the unmapped reads was conducted using the Newbler assembler 2.9 (Roche), with default parameters. The contigs were selected based on size/depth criteria: those of <500 bp and/or with a depth of <30x coverage were removed. To investigate whether unique genome sequences were present in strains isolated from the same patient, the unmapped reads of each strain were mapped to the contigs generated from all the strains in the same patient by the Bowtie 2 software. The coverage of the mapped regions was then evaluated. Gene predictions were performed using the gene prediction tool AUGUSTUS (version 2.5.5), with a training set of A. fumigatus (33). The parameters of AUGUSTUS were -species = aspergillus_fumigatus, -strand = both, -genemodel = partial, -singlestrand = false, -protein = on, -introns = on, -start = on, -stop = on, -cds = on, and -gff3 = on. To compare all the predicted genes with Aspergillus genes, consisting of 244,811 genes available on AspGD (34), a reciprocal BLAST best hit approach was performed by BLASTp (35), with an E value of 1.0e-4. All BLASTp results were filtered based on a BLASTp identity of \(\ge 80\)% and an aligned length coverage of \(\ge 80\)%.

General information

This document was last updated at 2022-05-23 12:26:01 using R Markdown (built with R version 3.5.2 (2018-12-20)). Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is especially powerful at authoring documents and reports which include code and can execute code and use the results in the output. For more details on using R Markdown see http://rmarkdown.rstudio.com and Rmarkdown cheatsheet.


Bibliography